NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scaling Optimization over Uncertainty via Compilation

https://doi.org/10.1145/3720500

Cho, Minsung; Gouwar, John; Holtzen, Steven (April 2025, Proceedings of the ACM on Programming Languages)

Probabilistic inference is fundamentally hard, yet many tasks require optimization on top of inference, which is even harder. We present a newoptimization-via-compilationstrategy to scalably solve a certain class of such problems. In particular, we introduce a new intermediate representation (IR), binary decision diagrams weighted by a novel notion ofbranch-and-bound semiring, that enables a scalable branch-and-bound based optimization procedure. This IR automaticallyfactorizesproblems through program structure andprunessuboptimal values via a straightforward branch-and-bound style algorithm to find optima. Additionally, the IR is naturally amenable tostaged compilation, allowing the programmer to query for optima mid-compilation to inform further executions of the program. We showcase the effectiveness and flexibility of the IR by implementing two performant languages that both compile to it: dappl and pineappl. dappl is a functional language that solves maximum expected utility problems with first-class support for rewards, decision making, and conditioning. pineappl is an imperative language that performs exact probabilistic inference with support for nested marginal maximum a posteriori (MMAP) optimization via staging.
more » « less
Full Text Available
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

https://doi.org/10.1145/3689735

Cassano, Federico; Gouwar, John; Lucchetti, Francesca; Schlesinger, Claire; Freeman, Anders; Anderson, Carolyn Jane; Feldman, Molly Q; Greenberg, Michael; Jangda, Abhinav; Guha, Arjun (October 2024, Proceedings of the ACM on Programming Languages)

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
more » « less
Full Text Available
Cryptographic hardness under projections for time-bounded Kolmogorov complexity

https://doi.org/10.1016/j.tcs.2022.10.040

Allender, Eric; Gouwar, John; Hirahara, Shuichi; Robelle, Caleb (November 2022, Theoretical Computer Science)

Full Text Available
Cryptographic Hardness Under Projections for Time-Bounded Kolmogorov Complexity

https://doi.org/10.4230/LIPIcs.ISAAC.2021.54

Allender, Eric; Gouwar, John; Hirahara, Shuichi; Robelle, Caleb (November 2021, Leibniz international proceedings in informatics)
Ahn, Hee-Kap; Sadakane, Kunihiko (Ed.)
A version of time-bounded Kolmogorov complexity, denoted KT, has received attention in the past several years, due to its close connection to circuit complexity and to the Minimum Circuit Size Problem MCSP. Essentially all results about the complexity of MCSP hold also for MKTP (the problem of computing the KT complexity of a string). Both MKTP and MCSP are hard for SZK (Statistical Zero Knowledge) under BPP-Turing reductions; neither is known to be NP-complete. Recently, some hardness results for MKTP were proved that are not (yet) known to hold for MCSP. In particular, MKTP is hard for DET (a subclass of P) under nonuniform ≤^{NC^0}_m reductions. In this paper, we improve this, to show that the complement of MKTP is hard for the (apparently larger) class NISZK_L under not only ≤^{NC^0}_m reductions but even under projections. Also, the complement of MKTP is hard for NISZK under ≤^{P/poly}_m reductions. Here, NISZK is the class of problems with non-interactive zero-knowledge proofs, and NISZK_L is the non-interactive version of the class SZK_L that was studied by Dvir et al. As an application, we provide several improved worst-case to average-case reductions to problems in NP, and we obtain a new lower bound on MKTP (which is currently not known to hold for MCSP).
more » « less
Full Text Available
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

https://doi.org/10.1109/TSE.2023.3267446

Cassano, Federico; Gouwar, John; Nguyen, Daniel; Nguyen, Sydney; Phipps-Costin, Luna; Pinckney, Donald; Yee, Ming-Ho; Zi, Yangtian; Anderson, Carolyn Jane; Feldman, Molly Q; et al (April 2023, IEEE Transactions on Software Engineering)
Michael Pradel (Ed.)
Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.
more » « less
Full Text Available
Cryptographic Hardness under Projections for Time-Bounded Kolmogorov Complexity

Allender, Eric; Gouwar, John; Hirahara, Shuichi; Robelle, Caleb (February 2021, Electronic colloquium on computational complexity)
null (Ed.)
Full Text Available

Search for: All records